Search | VHL Regional Portal

Species-aware DNA language models capture regulatory elements and their evolution.

Karollus, Alexander; Hingerl, Johannes; Gankin, Dennis; Grosshauser, Martin; Klemon, Kristian; Gagneur, Julien.

Genome Biol ; 25(1): 83, 2024 Apr 02.

Article in English | MEDLINE | ID: mdl-38566111

ABSTRACT

BACKGROUND: The rise of large-scale multi-species genome sequencing projects promises to shed new light on how genomes encode gene regulatory instructions. To this end, new algorithms are needed that can leverage conservation to capture regulatory elements while accounting for their evolution. RESULTS: Here, we introduce species-aware DNA language models, which we trained on more than 800 species spanning over 500 million years of evolution. Investigating their ability to predict masked nucleotides from context, we show that DNA language models distinguish transcription factor and RNA-binding protein motifs from background non-coding sequence. Owing to their flexibility, DNA language models capture conserved regulatory elements over much further evolutionary distances than sequence alignment would allow. Remarkably, DNA language models reconstruct motif instances bound in vivo better than unbound ones and account for the evolution of motif sequences and their positional constraints, showing that these models capture functional high-order sequence and evolutionary context. We further show that species-aware training yields improved sequence representations for endogenous and MPRA-based gene expression prediction, as well as motif discovery. CONCLUSIONS: Collectively, these results demonstrate that species-aware DNA language models are a powerful, flexible, and scalable tool to integrate information from large compendia of highly diverged genomes.

Subject(s)

DNA , Regulatory Sequences, Nucleic Acid , Binding Sites , Sequence Alignment , Algorithms , Conserved Sequence/genetics , Evolution, Molecular

Viral genome sequencing to decipher in-hospital SARS-CoV-2 transmission events.

Esser, Elisabeth; Schulte, Eva C; Graf, Alexander; Karollus, Alexander; Smith, Nicholas H; Michler, Thomas; Dvoretskii, Stefan; Angelov, Angel; Sonnabend, Michael; Peter, Silke; Engesser, Christina; Radonic, Aleksandar; Thürmer, Andrea; von Kleist, Max; Gebhardt, Friedemann; da Costa, Clarissa Prazeres; Busch, Dirk H; Muenchhoff, Maximilian; Blum, Helmut; Keppler, Oliver T; Gagneur, Julien; Protzer, Ulrike.

Sci Rep ; 14(1): 5768, 2024 03 08.

Article in English | MEDLINE | ID: mdl-38459123

ABSTRACT

The SARS-CoV-2 pandemic has highlighted the need to better define in-hospital transmissions, a need that extends to all other common infectious diseases encountered in clinical settings. To evaluate how whole viral genome sequencing can contribute to deciphering nosocomial SARS-CoV-2 transmission 926 SARS-CoV-2 viral genomes from 622 staff members and patients were collected between February 2020 and January 2021 at a university hospital in Munich, Germany, and analysed along with the place of work, duration of hospital stay, and ward transfers. Bioinformatically defined transmission clusters inferred from viral genome sequencing were compared to those inferred from interview-based contact tracing. An additional dataset collected at the same time at another university hospital in the same city was used to account for multiple independent introductions. Clustering analysis of 619 viral genomes generated 19 clusters ranging from 3 to 31 individuals. Sequencing-based transmission clusters showed little overlap with those based on contact tracing data. The viral genomes were significantly more closely related to each other than comparable genomes collected simultaneously at other hospitals in the same city (n = 829), suggesting nosocomial transmission. Longitudinal sampling from individual patients suggested possible cross-infection events during the hospital stay in 19.2% of individuals (14 of 73 individuals). Clustering analysis of SARS-CoV-2 whole genome sequences can reveal cryptic transmission events missed by classical, interview-based contact tracing, helping to decipher in-hospital transmissions. These results, in line with other studies, advocate for viral genome sequencing as a pathogen transmission surveillance tool in hospitals.

Subject(s)

COVID-19 , Cross Infection , Humans , SARS-CoV-2/genetics , COVID-19/epidemiology , COVID-19/genetics , Genome, Viral/genetics , Cross Infection/epidemiology , Cross Infection/genetics , Hospitals, University

Current sequence-based models capture gene expression determinants in promoters but mostly ignore distal enhancers.

Karollus, Alexander; Mauermeier, Thomas; Gagneur, Julien.

Genome Biol ; 24(1): 56, 2023 03 27.

Article in English | MEDLINE | ID: mdl-36973806

ABSTRACT

BACKGROUND: The largest sequence-based models of transcription control to date are obtained by predicting genome-wide gene regulatory assays across the human genome. This setting is fundamentally correlative, as those models are exposed during training solely to the sequence variation between human genes that arose through evolution, questioning the extent to which those models capture genuine causal signals. RESULTS: Here we confront predictions of state-of-the-art models of transcription regulation against data from two large-scale observational studies and five deep perturbation assays. The most advanced of these sequence-based models, Enformer, by and large, captures causal determinants of human promoters. However, models fail to capture the causal effects of enhancers on expression, notably in medium to long distances and particularly for highly expressed promoters. More generally, the predicted impact of distal elements on gene expression predictions is small and the ability to correctly integrate long-range information is significantly more limited than the receptive fields of the models suggest. This is likely caused by the escalating class imbalance between actual and candidate regulatory elements as distance increases. CONCLUSIONS: Our results suggest that sequence-based models have advanced to the point that in silico study of promoter regions and promoter variants can provide meaningful insights and we provide practical guidance on how to use them. Moreover, we foresee that it will require significantly more and particularly new kinds of data to train models accurately accounting for distal elements.

Subject(s)

Enhancer Elements, Genetic , Genomics , Humans , Genomics/methods , Promoter Regions, Genetic , Gene Expression Regulation , Gene Expression

The adapted Activity-By-Contact model for enhancer-gene assignment and its application to single-cell data.

Hecker, Dennis; Behjati Ardakani, Fatemeh; Karollus, Alexander; Gagneur, Julien; Schulz, Marcel H.

Bioinformatics ; 39(2)2023 02 03.

Article in English | MEDLINE | ID: mdl-36708003

ABSTRACT

MOTIVATION: Identifying regulatory regions in the genome is of great interest for understanding the epigenomic landscape in cells. One fundamental challenge in this context is to find the target genes whose expression is affected by the regulatory regions. A recent successful method is the Activity-By-Contact (ABC) model which scores enhancer-gene interactions based on enhancer activity and the contact frequency of an enhancer to its target gene. However, it describes regulatory interactions entirely from a gene's perspective, and does not account for all the candidate target genes of an enhancer. In addition, the ABC model requires two types of assays to measure enhancer activity, which limits the applicability. Moreover, there is neither implementation available that could allow for an integration with transcription factor (TF) binding information nor an efficient analysis of single-cell data. RESULTS: We demonstrate that the ABC score can yield a higher accuracy by adapting the enhancer activity according to the number of contacts the enhancer has to its candidate target genes and also by considering all annotated transcription start sites of a gene. Further, we show that the model is comparably accurate with only one assay to measure enhancer activity. We combined our generalized ABC model with TF binding information and illustrated an analysis of a single-cell ATAC-seq dataset of the human heart, where we were able to characterize cell type-specific regulatory interactions and predict gene expression based on TF affinities. All executed processing steps are incorporated into our new computational pipeline STARE. AVAILABILITY AND IMPLEMENTATION: The software is available at https://github.com/schulzlab/STARE. CONTACT: marcel.schulz@em.uni-frankfurt.de. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Subject(s)

Gene Expression Regulation , Transcription Factors , Humans , Transcription Factors/metabolism , Regulatory Sequences, Nucleic Acid , Software , Protein Binding

Predicting mean ribosome load for 5'UTR of any length using deep learning.

Karollus, Alexander; Avsec, Ziga; Gagneur, Julien.

PLoS Comput Biol ; 17(5): e1008982, 2021 05.

Article in English | MEDLINE | ID: mdl-33970899

ABSTRACT

The 5' untranslated region plays a key role in regulating mRNA translation and consequently protein abundance. Therefore, accurate modeling of 5'UTR regulatory sequences shall provide insights into translational control mechanisms and help interpret genetic variants. Recently, a model was trained on a massively parallel reporter assay to predict mean ribosome load (MRL)-a proxy for translation rate-directly from 5'UTR sequence with a high degree of accuracy. However, this model is restricted to sequence lengths investigated in the reporter assay and therefore cannot be applied to the majority of human sequences without a substantial loss of information. Here, we introduced frame pooling, a novel neural network operation that enabled the development of an MRL prediction model for 5'UTRs of any length. Our model shows state-of-the-art performance on fixed length randomized sequences, while offering better generalization performance on longer sequences and on a variety of translation-related genome-wide datasets. Variant interpretation is demonstrated on a 5'UTR variant of the gene HBB associated with beta-thalassemia. Frame pooling could find applications in other bioinformatics predictive tasks. Moreover, our model, released open source, could help pinpoint pathogenic genetic variants.

Subject(s)

5' Untranslated Regions , Deep Learning , Ribosomes/metabolism , Humans , RNA, Messenger/genetics

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL